The most common way to access a subset of the data in a dataframe is through the “.loc” attribute. This attribute uses square brackets instead of parentheses and contains two arguments: the first one tells Python which rows you want, and the second one tells it which columns you want, which generally looks like this:
DataFrame.loc[<which_rows>, <which_columns>]
where, instead of DataFrame you wouls use the name of the full dataframe you want to subset. Pandas allows for a lot of flexibility as to what you can use instead of <which_rows> and <which_column> above. See the examples in the official documentation to get a more complete picture of what is possible with .loc[]: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
Below, I cover the most common ways to specify which rows and columns you want:
By explicitly specifying the names (labels) of the index and/or column(s)
Using slices (ranges) on the index labels or common labels
Anything that returns a boolean sequence (True will be interpreted as “I want this row/column”)
Slicing also works, but this time we have to use index/column numbers, and the right-most end of the range is not included:
df.iloc[1:2, 0:2]
ticker
price
Apple
AAPL
2000
We rarely use boolean arrays with .iloc() so we will not cover it here.
The “.filter()” attribute
The .filter() attribute comes in handy if we want to subset based on index or column names (labels). In particular, its like parameter allows us to specify that we want all rows/columns that contain a particular piece of text in their label.
Let’s make a copy of df that we can safely change for this section:
newdf = dfnewdf
ticker
price
return
Tesla
TSLA
1000
0.10
Apple
AAPL
2000
0.05
Microsoft
MSFT
100
0.07
Many times, we want to store a subset of a dataframe inside a new dataframe. For example:
sub = newdf.loc[:,'price']sub
Tesla 1000
Apple 2000
Microsoft 100
Name: price, dtype: int64
Now suppose we have to make a change to the larger dataframe newdf. For example:
newdf.loc['Tesla','price'] =0newdf
ticker
price
return
Tesla
TSLA
0
0.10
Apple
AAPL
2000
0.05
Microsoft
MSFT
100
0.07
This change will be passed to sub, even tough we never created this change explicitly ourselves:
sub
Tesla 0
Apple 2000
Microsoft 100
Name: price, dtype: int64
This happened because, when we created sub with the command sub = newdf.loc[:,'price'], Pyhton did not actually create an entirely new dataframe. Instead, it just returned something like an address of where in newdf the price data can be found. This is called a view of the data.
This is done to preserve memory and speed up the code, but, like we saw above, it can cause some of our dataframes change when we edit other dataframes.
To avoid this possible problem, I recommend always telling Python to create a copy of the subset of data you want, using the .copy() attribute. In our example above, sub should have been created like this:
sub = newdf.loc[:,'price'].copy()sub
Tesla 0
Apple 2000
Microsoft 100
Name: price, dtype: int64
Now, changes to newdf, like this:
newdf.loc['Apple','price'] =123newdf
ticker
price
return
Tesla
TSLA
0
0.10
Apple
AAPL
123
0.05
Microsoft
MSFT
100
0.07
Will not cause sub to change:
sub
Tesla 0
Apple 2000
Microsoft 100
Name: price, dtype: int64
Filtering on a MultiIndex
So far, the all the dataframes we’ve seen have had a one-dimensional index (a signle column). Dataframes can have a higher-dimensional index, and when they do, Pandas calls that index a MultiIndex (for a more thorough tutorial on MultiIndex, see https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)
Let’s create an example dataframe with a MultiIndex. Below, we create the MultiIndex using the pd.MultiIndex.from_product() function, but there are several other ways of creating one (see the link above).
m = pd.DataFrame(data = np.random.rand(9,3), columns =list('ABC'), index = pd.MultiIndex.from_product([['AAPL','TSLA','MSFT'], [2007,2008,2009]], names = ['ticker','year'] ) )m
Each entry (row) in the index is a tuple (note the parentheses):
m.index[0]
('AAPL', 2007)
This means that if we want to use index labels with .loc to extract some subset of a dataframe, we need to use a tuple when we specify those index labels:
m.loc[('AAPL',2007), :]
A 0.116294
B 0.686215
C 0.323383
Name: (AAPL, 2007), dtype: float64
We can use slices on each dimension of the index, though you should make sure that you sorted your data by the values in the index first (using the sort_index() function:
m = m.sort_index()m
A
B
C
ticker
year
AAPL
2007
0.116294
0.686215
0.323383
2008
0.343762
0.697231
0.670671
2009
0.419409
0.657433
0.987052
MSFT
2007
0.006004
0.500773
0.291589
2008
0.861580
0.300969
0.346424
2009
0.693928
0.514345
0.973823
TSLA
2007
0.994865
0.463342
0.093536
2008
0.762471
0.623193
0.221052
2009
0.439225
0.659502
0.111453
I recommend using the slice() to slide on each dimension of the index. Below, slice(None) means no condition should be imposed on that dimension of the index: